{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "UMqBL77hMXP2"
      },
      "source": [
        "# Text preprocessing pipeline\n",
        "Authors:  \n",
        " - [Lior Gazit](https://www.linkedin.com/in/liorgazit).  \n",
        " - [Meysam Ghaffari](https://www.linkedin.com/in/meysam-ghaffari-ph-d-a2553088/).  \n",
        "\n",
        "This notebook is taught and reviewed in our book:  \n",
        "**[Mastering NLP from Foundations to LLMs](https://www.amazon.com/dp/1804619183)**  \n",
        "![image.png]()\n",
        "\n",
        "This Colab notebook is referenced in our book's Github repo:   \n",
        "https://github.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs   \n",
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter4_notebooks/Ch4_Preprocessing_Pipeline.ipynb\">\n",
        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
        "</a>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "dO3hrUVTRoMN"
      },
      "source": [
        "**The purpose of this notebook:**  \n",
        "As demonstrated in Chapter 4 of the book, text preprocessing is one of the most fundamental practices of NLP.  \n",
        "In this notebook we walk you through a variety of preprocessing functions and show how they come together to a solid pipeline.   \n",
        "\n",
        "**Requirements:**  \n",
        "* When running in Colab, use this runtime notebook setting: `Python 3, CPU`  \n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "g54Uf66Vz9Fi"
      },
      "source": [
        ">*```Disclaimer: The content and ideas presented in this notebook are solely those of the authors and do not represent the views or intellectual property of the authors' employers.```*"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "ayuVflRZ5w9C"
      },
      "source": [
        "Install:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DhrOK1E1lxTo",
        "outputId": "92ab8692-a3eb-4dfe-e82f-2b82d6bae042"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: num2words in /usr/local/lib/python3.10/dist-packages (0.5.13)\n",
            "Requirement already satisfied: autocorrect in /usr/local/lib/python3.10/dist-packages (2.6.1)\n",
            "Requirement already satisfied: docopt>=0.6.2 in /usr/local/lib/python3.10/dist-packages (from num2words) (0.6.2)\n"
          ]
        }
      ],
      "source": [
        "# REMARK:\n",
        "# If the below code error's out due to a Python package discrepency, it may be because new versions are causing it.\n",
        "# In which case, set \"default_installations\" to False to revert to the original image:\n",
        "default_installations = True\n",
        "if default_installations:\n",
        "  !pip install num2words autocorrect\n",
        "else:\n",
        "  import requests\n",
        "  text_file_path = \"requirements__Ch4_Preprocessing_Pipeline.txt\"\n",
        "  url = \"https://raw.githubusercontent.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/main/Chapter4_notebooks/\" + text_file_path\n",
        "  res = requests.get(url)\n",
        "  with open(text_file_path, \"w\") as f:\n",
        "    f.write(res.text)\n",
        "\n",
        "  !pip install -r requirements__Ch4_Preprocessing_Pipeline.txt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3M3KTFuMSgCf",
        "outputId": "90ed8f76-68e3-4211-9d21-b730f1ce864f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n",
            "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
            "[nltk_data]   Package stopwords is already up-to-date!\n",
            "[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
            "[nltk_data]   Package wordnet is already up-to-date!\n"
          ]
        }
      ],
      "source": [
        "# Imports:\n",
        "import re\n",
        "from num2words import num2words\n",
        "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')\n",
        "from nltk.corpus import stopwords\n",
        "from nltk.stem.porter import PorterStemmer\n",
        "from nltk.stem import WordNetLemmatizer\n",
        "from autocorrect import Speller\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "id": "ivh-TwcqSnVd"
      },
      "outputs": [],
      "source": [
        "# Preprocessing functions:\n",
        "def decode(text):\n",
        "  \"\"\"\n",
        "  The function takes in a string of text as input\n",
        "  and extracts the subject line and body text from the text\n",
        "  using regular expressions. It then formats the extracted\n",
        "  text into a single string and returns it as output.\n",
        "\n",
        "  Input: str\n",
        "  Output: str\n",
        "  \"\"\"\n",
        "  text = re.sub(\"\\\\n|\\\\r|\\\\t|-\", \" \", text)\n",
        "  subject_line_search = re.search(r\"<SUBJECT LINE>(.*?)<END>\", text, flags=re.S)\n",
        "  body_text_search = re.search(r\"<BODY TEXT>(.*?)<END>\", text, flags=re.S)\n",
        "\n",
        "  formated_output = \"\"\n",
        "  if subject_line_search:\n",
        "    formated_output = formated_output + subject_line_search.groups()[0] + \". \"\n",
        "  if body_text_search:\n",
        "    formated_output = formated_output + body_text_search.groups()[0] + \".\"\n",
        "  return formated_output\n",
        "\n",
        "\n",
        "def digits_to_words(match):\n",
        "  \"\"\"\n",
        "  Convert string digits to the English words. The function distinguishes between\n",
        "  cardinal and ordinal.\n",
        "  E.g. \"2\" becomes \"two\", while \"2nd\" becomes \"second\"\n",
        "\n",
        "  Input: str\n",
        "  Output: str\n",
        "  \"\"\"\n",
        "  suffixes = ['st', 'nd', 'rd', 'th']\n",
        "  # Making sure it's lower cased so not to rely on previous possible actions:\n",
        "  string = match[0].lower()\n",
        "  if string[-2:] in suffixes:\n",
        "    type='ordinal'\n",
        "    string = string[:-2]\n",
        "  else:\n",
        "    type='cardinal'\n",
        "\n",
        "  return num2words(string, to=type)\n",
        "\n",
        "\n",
        "def spelling_correction(text):\n",
        "    \"\"\"\n",
        "    Replace misspelled words with the correct spelling.\n",
        "\n",
        "    Input: str\n",
        "    Output: str\n",
        "    \"\"\"\n",
        "    corrector = Speller()\n",
        "    spells = [corrector(word) for word in text.split()]\n",
        "    return \" \".join(spells)\n",
        "\n",
        "\n",
        "def remove_stop_words(text):\n",
        "    \"\"\"\n",
        "    Remove stopwords.\n",
        "\n",
        "    Input: str\n",
        "    Output: str\n",
        "    \"\"\"\n",
        "    stopwords_set = set(stopwords.words('english'))\n",
        "    return \" \".join([word for word in text.split() if word not in stopwords_set])\n",
        "\n",
        "\n",
        "def stemming(text):\n",
        "    \"\"\"\n",
        "    Perform stemming of each word individually.\n",
        "\n",
        "    Input: str\n",
        "    Output: str\n",
        "    \"\"\"\n",
        "    stemmer = PorterStemmer()\n",
        "    return \" \".join([stemmer.stem(word) for word in text.split()])\n",
        "\n",
        "\n",
        "def lemmatizing(text):\n",
        "    \"\"\"\n",
        "    Perform lemmatization for each word individually.\n",
        "\n",
        "    Input: str\n",
        "    Output: str\n",
        "    \"\"\"\n",
        "    lemmatizer = WordNetLemmatizer()\n",
        "    return \" \".join([lemmatizer.lemmatize(word) for word in text.split()])\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "U7wnjrxqS0Ub"
      },
      "outputs": [],
      "source": [
        "# Preprocessing pipeline:\n",
        "def preprocessing(input_text, printing=False):\n",
        "  \"\"\"\n",
        "  This function represents a complete pipeline for text preprocessing.\n",
        "\n",
        "  Code design note: The fact that we update variable \"output\" instead of\n",
        "  creating new variables with new names as we go, allows us to change the\n",
        "  order of the actions or add/remove actions easily.\n",
        "\n",
        "  Input: str\n",
        "  Output: str\n",
        "  \"\"\"\n",
        "  output = input_text\n",
        "  # Decode/remove encoding:\n",
        "  output = decode(output)\n",
        "  print(\"\\nDecode/remove encoding:\\n        \", output)\n",
        "\n",
        "  # Lower casing:\n",
        "  output = output.lower()\n",
        "  print(\"\\nLower casing:\\n        \", output)\n",
        "\n",
        "  # Convert digits to words:\n",
        "  # The following regex syntax looks for matching of consequtive digits tentatively followed by an ordinal suffix:\n",
        "  output = re.sub(r'\\d+(st)?(nd)?(rd)?(th)?', digits_to_words, output, flags=re.IGNORECASE)\n",
        "  print(\"\\nDigits to words\\n        \", output)\n",
        "\n",
        "  # Remove punctuations and other special characters:\n",
        "  output = re.sub('[^ A-Za-z0-9]+', '', output)\n",
        "  print(\"\\nRemove punctuations and other special characters\\n        \", output)\n",
        "\n",
        "  # Spelling corrections:\n",
        "  output = spelling_correction(output)\n",
        "  print(\"\\nSpelling corrections:\\n        \", output)\n",
        "\n",
        "\n",
        "  # Remove stop words:\n",
        "  output = remove_stop_words(output)\n",
        "  print(\"\\nRemove stop words:\\n        \", output)\n",
        "\n",
        "  # Stemming:\n",
        "  output = stemming(output)\n",
        "  print(\"\\nStemming:\\n        \", output)\n",
        "\n",
        "  # Lemmatizing:\n",
        "  output = lemmatizing(output)\n",
        "  print(\"\\nLemmatizing:\\n        \", output)\n",
        "\n",
        "  return output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2Duu0SyeS2jW",
        "outputId": "619a7797-6e09-4e2a-a3a3-f2d61968c906"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "This is the input raw text:\n",
            "\n",
            "\"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,\n",
            "1st one is pairoll, 2nd is healtcare!<END>\"\n",
            "\n",
            "\n",
            "Decode/remove encoding:\n",
            "          Employees details. Attached are 2 files, 1st one is pairoll, 2nd is healtcare!.\n",
            "\n",
            "Lower casing:\n",
            "          employees details. attached are 2 files, 1st one is pairoll, 2nd is healtcare!.\n",
            "\n",
            "Digits to words\n",
            "          employees details. attached are two files, first one is pairoll, second is healtcare!.\n",
            "\n",
            "Remove punctuations and other special characters\n",
            "          employees details attached are two files first one is pairoll second is healtcare\n",
            "\n",
            "Spelling corrections:\n",
            "         employees details attached are two files first one is payroll second is healthcare\n",
            "\n",
            "Remove stop words:\n",
            "         employees details attached two files first one payroll second healthcare\n",
            "\n",
            "Stemming:\n",
            "         employe detail attach two file first one payrol second healthcar\n",
            "\n",
            "Lemmatizing:\n",
            "         employe detail attach two file first one payrol second healthcar\n",
            "\n",
            "----------------------------\n",
            "This is the preprocessed text:\n",
            "        employe detail attach two file first one payrol second healthcar\n"
          ]
        }
      ],
      "source": [
        "# Applying preprocessing:\n",
        "raw_text_input = \"\"\"\n",
        "\"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,\\n1st one is pairoll, 2nd is healtcare!<END>\"\n",
        "\"\"\"\n",
        "print(f\"This is the input raw text:\\n{raw_text_input}\")\n",
        "\n",
        "print(f\"\\n----------------------------\\nThis is the preprocessed text:\\n        {preprocessing(raw_text_input, printing=True)}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "pMKB4Uk8kv46"
      },
      "outputs": [],
      "source": [
        "\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
